Analysis on Used Cars Data

Group members:

1.Aparna Subha

2.Divya A. Datla

3.Similoluwa Adelore

4.Saeed Irteza Haseem

5.Moiz Shaikh

6.Aditya Makhija

Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import sys
import seaborn as sns
import matplotlib
from matplotlib import pyplot
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import r2_score
from sklearn.svm import SVR
from scipy import stats
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor as rfr
import xgboost
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

Import Dataset

In [2]:
#giving path to dataset
data = pd.read_csv("vehicles.csv")
data.head()#top 5 rows in the dataset
Out[2]:
id url region region_url price year manufacturer model condition cylinders ... drive size type paint_color image_url description county state lat long
0 7088746062 https://greensboro.craigslist.org/ctd/d/cary-2... greensboro https://greensboro.craigslist.org 10299 2012.0 acura tl NaN NaN ... NaN NaN other blue https://images.craigslist.org/01414_3LIXs9EO33... 2012 Acura TL Base 4dr Sedan Offered by: B... NaN nc 35.7636 -78.7443
1 7088745301 https://greensboro.craigslist.org/ctd/d/bmw-3-... greensboro https://greensboro.craigslist.org 0 2011.0 bmw 335 NaN 6 cylinders ... rwd NaN convertible blue https://images.craigslist.org/00S0S_1kTatLGLxB... BMW 3 Series 335i Convertible Navigation Dakot... NaN nc NaN NaN
2 7088744126 https://greensboro.craigslist.org/cto/d/greens... greensboro https://greensboro.craigslist.org 9500 2011.0 jaguar xf excellent NaN ... NaN NaN NaN blue https://images.craigslist.org/00505_f22HGItCRp... 2011 jaguar XF premium - estate sale. Retired ... NaN nc 36.1032 -79.8794
3 7088743681 https://greensboro.craigslist.org/ctd/d/cary-2... greensboro https://greensboro.craigslist.org 3995 2004.0 honda element NaN NaN ... fwd NaN SUV orange https://images.craigslist.org/00E0E_eAUnhFF86M... 2004 Honda Element LX 4dr SUV Offered by: ... NaN nc 35.7636 -78.7443
4 7074612539 https://lincoln.craigslist.org/ctd/d/gretna-20... lincoln https://lincoln.craigslist.org 41988 2016.0 chevrolet silverado k2500hd NaN NaN ... NaN NaN NaN NaN https://images.craigslist.org/00S0S_8msT7RQquO... Shop Indoors, Heated Showroom!!!www.gretnaauto... NaN ne 41.1345 -96.2458

5 rows × 25 columns

Data Preprocessing

In [3]:
data.info()#summary of the dataset features and values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539759 entries, 0 to 539758
Data columns (total 25 columns):
id              539759 non-null int64
url             539759 non-null object
region          539759 non-null object
region_url      539759 non-null object
price           539759 non-null int64
year            538772 non-null float64
manufacturer    516175 non-null object
model           531746 non-null object
condition       303707 non-null object
cylinders       321264 non-null object
fuel            536366 non-null object
odometer        440783 non-null float64
title_status    536819 non-null object
transmission    535786 non-null object
vin             315349 non-null object
drive           383987 non-null object
size            168550 non-null object
type            392290 non-null object
paint_color     365520 non-null object
image_url       539740 non-null object
description     539738 non-null object
county          0 non-null float64
state           539759 non-null object
lat             530785 non-null float64
long            530785 non-null float64
dtypes: float64(5), int64(2), object(18)
memory usage: 103.0+ MB

Dealing with missing values

In [4]:
# checking for missing values
data.isnull().sum()
Out[4]:
id                   0
url                  0
region               0
region_url           0
price                0
year               987
manufacturer     23584
model             8013
condition       236052
cylinders       218495
fuel              3393
odometer         98976
title_status      2940
transmission      3973
vin             224410
drive           155772
size            371209
type            147469
paint_color     174239
image_url           19
description         21
county          539759
state                0
lat               8974
long              8974
dtype: int64

Before we move on, we can see from the results that there are some features that have really low or zero representation in the data and there are some we don't need in for our analysis like id column, url etc.

So we would drop them.

In [5]:
df_car = data.drop(['id','url','region_url','county',"description", 'image_url','title_status','vin'], axis = 1)#dropping unecessary features

Examining categorical variables

Again as we earlier noted most of the variables are categorical and the numerical ones have little or no missing values. It would be good examine the categories so we can decide how to deal with the missing data.

In [6]:
categoricalColumn = ['region', 'year', 'manufacturer', 'model',
'condition','fuel','size','paint_color','state','transmission','drive','type']

for col in categoricalColumn:
    print('Value Counts for ' + col)
    print(df_car[col].value_counts())
    print('')
Value Counts for region
fayetteville                4131
springfield                 3985
rochester                   3819
columbus                    3589
jacksonville                3541
charleston                  3164
richmond                    3067
york                        2995
tri-cities                  2991
salem                       2991
bakersfield                 2990
ocala                       2990
kennewick-pasco-richland    2989
daytona beach               2989
omaha / council bluffs      2988
lakeland                    2987
grand rapids                2987
eugene                      2987
spokane / coeur d'alene     2987
sacramento                  2986
modesto                     2985
orlando                     2985
redding                     2984
ventura county              2984
akron / canton              2983
fort collins / north CO     2983
bellingham                  2983
greensboro                  2982
indianapolis                2981
tucson                      2981
                            ... 
mattoon-charleston            92
central louisiana             91
eastern montana               83
pierre / central SD           83
clovis / portales             81
northeast SD                  81
tuscarawas co                 79
oneonta                       79
kirksville                    72
sandusky                      70
la salle co                   69
eastern CO                    68
lake charles                  68
siskiyou county               67
southwest KS                  67
meridian                      64
owensboro                     62
meadville                     61
provo / orem                  60
statesboro                    60
logan                         50
north platte                  47
twin tiers NY/PA              46
west virginia (old)           44
southwest TX                  42
southwest MS                  38
susanville                    37
ogden-clearfield              16
st louis                       4
kansas city                    4
Name: region, Length: 403, dtype: int64

Value Counts for year
2017.0    40806
2016.0    39716
2015.0    38935
2013.0    37904
2014.0    37261
2012.0    35324
2011.0    32473
2008.0    29788
2007.0    27038
2010.0    25496
2018.0    24996
2006.0    22666
2009.0    20342
2005.0    18827
2019.0    17437
2004.0    15980
2003.0    12578
2002.0     9818
2001.0     7745
2000.0     6460
1999.0     5149
1998.0     3315
1997.0     2937
2020.0     2922
1995.0     1874
1996.0     1793
1994.0     1536
1993.0     1134
1991.0      827
1992.0      815
          ...  
1939.0       50
1932.0       49
1934.0       45
1936.0       40
1923.0       28
1935.0       24
1933.0       24
1928.0       23
1938.0       21
1918.0       17
1927.0       17
1926.0       13
1942.0       11
1925.0        7
1922.0        5
1943.0        4
1914.0        4
1912.0        4
1908.0        3
1924.0        2
1919.0        2
1917.0        1
1916.0        1
1915.0        1
1913.0        1
1909.0        1
1903.0        1
1901.0        1
1945.0        1
0.0           1
Name: year, Length: 113, dtype: int64

Value Counts for manufacturer
ford               98858
chevrolet          80608
toyota             40317
nissan             28546
honda              26077
jeep               25759
ram                25673
gmc                23908
dodge              18898
bmw                14709
hyundai            12748
mercedes-benz      11652
subaru             11581
volkswagen         10845
kia                 9712
chrysler            8802
cadillac            8265
buick               7442
lexus               6699
mazda               6347
audi                6011
infiniti            4416
acura               4108
pontiac             3774
lincoln             3666
volvo               3159
mitsubishi          2760
mini                2376
rover               1996
mercury             1847
saturn              1773
jaguar              1138
fiat                 926
tesla                268
harley-davidson      217
alfa-romeo            99
datsun                68
ferrari               58
aston-martin          28
land rover            25
porche                13
morgan                 2
hennessey              1
Name: manufacturer, dtype: int64

Value Counts for model
f-150                              11630
silverado 1500                      7302
1500                                6623
silverado                           5679
wrangler                            4270
accord                              3933
altima                              3920
2500                                3865
camry                               3746
grand cherokee                      3637
escape                              3590
civic                               3398
explorer                            3368
tacoma                              3203
sierra 1500                         3041
equinox                             3024
tahoe                               2908
silverado 2500hd                    2906
focus                               2891
mustang                             2817
fusion                              2808
malibu                              2784
impala                              2760
corolla                             2676
f-250                               2657
grand caravan                       2355
cr-v                                2277
3500                                2264
tundra                              2227
sonata                              2150
                                   ...  
1971 Corvette Stingray                 1
wrx premium 6m 6-speed manual          1
Kawasaki Mule  4010 Trans4x4           1
xc60 t5 platinum sport                 1
s80 v8                                 1
civic ex-l sport sedan                 1
gr. caravan, gt                        1
Willys 1/4 Ton M38a1                   1
IHC 4300-Cummins                       1
e-150 handicap van                     1
elantra base                           1
4500 hd chassis                        1
porsche C4S                            1
1500 4.7                               1
colorado crew cab 4wd                  1
Used Car Dealership Cars in Van        1
Monte Carlo SS                         1
altima hybird                          1
cummins 3500 dually                    1
benz ga 250 4matic                     1
challenger gt coupe                    1
elantra accent gls                     1
2500 savana 3dr cargo van              1
Freigthliner                           1
cooper 2012                            1
1987 Corvette convertible              1
rma 1500                               1
c10 short wide truck                   1
s430 4.3l                              1
f-250 powerstroke xlt 4x4              1
Name: model, Length: 36948, dtype: int64

Value Counts for condition
excellent    142619
good         119938
like new      29925
fair           8882
new            1611
salvage         732
Name: condition, dtype: int64

Value Counts for fuel
gas         469078
diesel       44258
other        17712
hybrid        4242
electric      1076
Name: fuel, dtype: int64

Value Counts for size
full-size      92545
mid-size       47033
compact        25014
sub-compact     3958
Name: size, dtype: int64

Value Counts for paint_color
white     95528
black     73986
silver    54442
blue      37221
red       36440
grey      34873
green      9580
custom     9237
brown      8245
yellow     2639
orange     2417
purple      912
Name: paint_color, dtype: int64

Value Counts for state
ca    55178
fl    39265
tx    30057
mi    23417
ny    23287
oh    20700
nc    19906
or    19835
pa    18556
wa    17042
tn    15490
wi    15272
va    14197
co    12580
il    12258
ia    10948
mt    10725
id    10430
ma    10298
mn    10269
nj    10228
sc     9604
az     9069
al     8528
ga     8344
in     7702
mo     7195
ct     6922
ar     6901
ks     6861
ok     6804
ky     6497
la     5977
md     5046
nm     4649
ak     3944
ne     3261
nv     3117
ms     2984
nh     2973
vt     2972
dc     2967
hi     2966
me     2963
ri     2962
sd     2283
wv     1727
de     1495
ut     1259
wy     1185
nd      664
Name: state, dtype: int64

Value Counts for transmission
automatic    475946
manual        34981
other         24859
Name: transmission, dtype: int64

Value Counts for drive
4wd    178592
fwd    133148
rwd     72247
Name: drive, dtype: int64

Value Counts for type
sedan          96119
SUV            94626
pickup         55934
truck          50419
coupe          21664
other          18351
hatchback      14791
wagon          12194
van            11006
convertible     9089
mini-van        6685
offroad          835
bus              577
Name: type, dtype: int64

In [7]:
df_cars = df_car.dropna() #creating a new dataframe without the na values
In [8]:
df_cars.isnull().sum()
Out[8]:
region          0
price           0
year            0
manufacturer    0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
drive           0
size            0
type            0
paint_color     0
state           0
lat             0
long            0
dtype: int64
In [9]:
df_cars.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 105531 entries, 13 to 539752
Data columns (total 17 columns):
region          105531 non-null object
price           105531 non-null int64
year            105531 non-null float64
manufacturer    105531 non-null object
model           105531 non-null object
condition       105531 non-null object
cylinders       105531 non-null object
fuel            105531 non-null object
odometer        105531 non-null float64
transmission    105531 non-null object
drive           105531 non-null object
size            105531 non-null object
type            105531 non-null object
paint_color     105531 non-null object
state           105531 non-null object
lat             105531 non-null float64
long            105531 non-null float64
dtypes: float64(4), int64(1), object(12)
memory usage: 14.5+ MB

As seen above the new dataframe is without any missing values

Why remove the missing data?

First we have a lot more categorical variables than numerical, it may not be wise to fill the missing parts of the data by imputation because the data would be skewed. Although, we lost a lot of data, we still have a lot with which we can do how analysis.

In [10]:
# convert year column from float to int
df_cars['year']=df_cars['year'].astype(int)
In [11]:
df_cars.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 105531 entries, 13 to 539752
Data columns (total 17 columns):
region          105531 non-null object
price           105531 non-null int64
year            105531 non-null int32
manufacturer    105531 non-null object
model           105531 non-null object
condition       105531 non-null object
cylinders       105531 non-null object
fuel            105531 non-null object
odometer        105531 non-null float64
transmission    105531 non-null object
drive           105531 non-null object
size            105531 non-null object
type            105531 non-null object
paint_color     105531 non-null object
state           105531 non-null object
lat             105531 non-null float64
long            105531 non-null float64
dtypes: float64(3), int32(1), int64(1), object(12)
memory usage: 14.1+ MB

Exploratory Analysis on the dependent variable

The next thing in our analysis is to examine the data a little bit more. After this we would do a little more clearing and preparation of the data before analysis. For now, we need the names in the categorical data for better visualization.

This examination is to make more sense of the data we have using graphs and smaller tables etc,.

Lets take a look at our dependent variable price

In [12]:
df_cars['price'].value_counts()
Out[12]:
0        4241
1        1562
3500     1421
4500     1283
5500     1244
6995     1147
6500     1141
5995     1102
4995     1089
7995     1063
2500     1058
8995     1010
3995      974
7500      929
5000      912
9995      899
3000      833
8500      801
4000      777
10995     739
14995     722
13995     704
6000      688
12995     681
5900      671
9500      669
4900      612
2995      602
11995     602
1500      597
         ... 
26840       1
15596       1
23840       1
17699       1
12691       1
6294        1
24665       1
13741       1
11694       1
14290       1
16275       1
9327        1
405         1
6870        1
29275       1
725         1
8913        1
19100       1
10670       1
43518       1
12845       1
14956       1
7080        1
41599       1
2794        1
64010       1
469         1
4969        1
20895       1
19577       1
Name: price, Length: 4999, dtype: int64
In [13]:
df_cars.price.describe()#five-number summary of the price feature
Out[13]:
count    1.055310e+05
mean     1.189588e+05
std      1.849615e+07
min      0.000000e+00
25%      4.400000e+03
50%      7.999000e+03
75%      1.499500e+04
max      3.755744e+09
Name: price, dtype: float64
In [14]:
# We would plot an histogram to see how varied the prices are
plt.hist(df_cars["price"])
Out[14]:
(array([1.05527e+05, 0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 2.00000e+00]),
 array([0.00000000e+00, 3.75574432e+08, 7.51148864e+08, 1.12672330e+09,
        1.50229773e+09, 1.87787216e+09, 2.25344659e+09, 2.62902102e+09,
        3.00459545e+09, 3.38016989e+09, 3.75574432e+09]),
 <a list of 10 Patch objects>)

So the prices are really on the higher side, we would bring it down to thousands just to see the skewness, peak etc before we standardize or normalize the data

In [15]:
 sns.boxplot(y='price', data=df_cars)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x2519700cc18>
In [16]:
plt.figure(figsize=(3,6))
sns.boxplot(y='price', data=df_cars,showfliers=False)#visualising the data without the outliers skewing
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x251970c2cf8>
In [17]:
sns.violinplot(y='price', data=df_cars)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x251992079b0>

From the boxplot and the violinplot we see that there are a lot of outliers in the data so the data is extremely skewed. We can see that also from the describe function. The maximum value of cars sold runs in billions.

Lets look into it more closely.

Top priced Cars

In [18]:
# Cars with the 20 highest values
top15 = df_cars.nlargest(20,"price")
top15.price
Out[18]:
141176    3755744318
243376    3755744309
363675    2521176519
21915     1234567890
209045     111111111
474592       5599500
152364        600000
277682        450000
133153        375000
428730        335000
255094        320025
121113        300000
472788        270000
371887        265000
434848        240000
490237        235000
369598        228841
30251         223000
348492        223000
288782        220000
Name: price, dtype: int64

From the above results above we can see that:

  1. Ford has the highest price with 3 billion dollars which is ridiculously high.
  2. We can also observe that after Ram manufacturer, cars generally priced at 2 million at most.
  3. Highly priced cars mostly automatic and in excellent condition.
  4. Only for cars are valued between 1 to 2 million dollars

Least Priced Cars

In [19]:
# Cars with the 20 lowest prices
least15 = df_cars.nsmallest(20,"price")
least15.price
Out[19]:
221     0
334     0
597     0
613     0
623     0
873     0
1299    0
1542    0
1662    0
1728    0
1731    0
2165    0
2216    0
2608    0
3490    0
3492    0
3538    0
3553    0
3564    0
3584    0
Name: price, dtype: int64
In [20]:
df = df_cars[['manufacturer','price']][df_cars.price==df_cars['price'].min()]#the lowest prices of certain manufactures 
df.head()
Out[20]:
manufacturer price
221 nissan 0
334 ford 0
597 nissan 0
613 ford 0
623 toyota 0

4657 of 106,000 rows that we have have cars priced at zeros and 1755 priced at ones so we would remove them and see if there are other awkward cases so we know how to treat them.

In [21]:
# removing columns with prices at zero
df_cars.drop(df_cars[df_cars['price'] == 0 ].index,inplace = True)
In [22]:
# removing columns with prices at one
df_cars.drop(df_cars[df_cars['price'] == 1 ].index,inplace = True)
In [23]:
 # Cars with the lowest value
least15 = df_cars.nsmallest(1554,"price")
least15.price
Out[23]:
449265       2
99633        3
102660       3
122807       3
397991       3
463521       3
175633       4
216880       4
391723       4
192515       5
382095       5
528714       5
324710       6
379023       8
391611       8
290532       9
410512       9
319521      10
528871      10
534101      10
344667      11
512821      11
72368       12
385387      12
445193      12
67973       13
377875      13
122707      14
280392      14
192040      15
          ... 
446297    1000
447588    1000
450459    1000
453007    1000
457269    1000
457273    1000
457525    1000
458153    1000
459159    1000
464707    1000
467549    1000
473732    1000
476477    1000
477956    1000
478395    1000
482330    1000
482802    1000
483205    1000
483291    1000
484330    1000
488267    1000
492831    1000
495383    1000
498025    1000
502190    1000
504598    1000
507441    1000
508503    1000
508966    1000
518647    1000
Name: price, Length: 1554, dtype: int64

We still have some unusual cases we need to deal with. What we intend to do is limit the data to prices that are less than 3 standard deviations from the mean.

Treating Outliers (Unusual Cases)

Exploration on Price

In [24]:
#outliers USD_goal
outliers = df_cars[(np.abs(stats.zscore(df_cars['price'])) >= 3 )]
outliers.sort_values('price',inplace=True)
print("Total Outlier observations :")
print(outliers.shape)
p_out_usd = round(  ((len(outliers)/len(df_cars))*100 ) , 2)
print("Proportion of Outliers : %s %%\n" % p_out_usd)
Total Outlier observations :
(5, 17)
Proportion of Outliers : 0.01 %

In [25]:
outliers_sub = df_cars[(np.abs(stats.zscore(df_cars['price'])) >=3 )]
df_cars = df_cars[ ~df_cars.index.isin(outliers_sub.index) ]
print("Observations after removing outliers (price) : ")
print(df_cars.shape)
Observations after removing outliers (price) : 
(99723, 17)

Using the z score didn't remove to the outliers. From what we observed above the prices after removal where at most 2 million dollars, even after that there are just a few that are priced at a million dollars so it might be good to remove that too. Also looking at the least prices we have cars for 3,4,5 dollars and some other unreasonable prices which are reported to be new or in excellent condition. We intend to do is create lower and upper bounds of 1000 dollar to 150000 dollars respectively.

In [26]:
df_cars = df_cars[df_cars['price'] > 1000] #creating lower and upper bounds
df_cars = df_cars[df_cars['price'] < 150000]#now df_cars only have pricevalues within this limit
print('Min:' ,min(df_cars.price),'Max:',max(df_cars.price))
Min: 1001 Max: 145000

Exploration on Years

In [27]:
sns.distplot(df_cars.year)#visualising the year distribution density
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x251992690f0>

Our data has cars that date back to below 1990, we intend to remove them to improve price predictions as the data is more concentrated on the last 30 years

In [28]:
(df_cars['year'] <1990).sum()#total values under 1990
Out[28]:
2137

Only 2137 observations below 1990.That is only a minute value compared to the total observations.Hence we can remove them

In [29]:
df_cars1 = df_cars[(df_cars['year'] >1990)]
df_cars1.info()#info about cars listed after 1990
<class 'pandas.core.frame.DataFrame'>
Int64Index: 95814 entries, 13 to 539752
Data columns (total 17 columns):
region          95814 non-null object
price           95814 non-null int64
year            95814 non-null int32
manufacturer    95814 non-null object
model           95814 non-null object
condition       95814 non-null object
cylinders       95814 non-null object
fuel            95814 non-null object
odometer        95814 non-null float64
transmission    95814 non-null object
drive           95814 non-null object
size            95814 non-null object
type            95814 non-null object
paint_color     95814 non-null object
state           95814 non-null object
lat             95814 non-null float64
long            95814 non-null float64
dtypes: float64(3), int32(1), int64(1), object(12)
memory usage: 12.8+ MB
In [30]:
# plotting the data from 1990 and above
plt.figure(figsize=(15,9))
ax = sns.countplot(x='year',data=df_cars1,palette = 'GnBu_d');
ax.set_ylabel('Cars Sold',fontsize=15)
ax.set_title('Number of cars sold in last 30 years',fontsize=15)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right",fontsize=10);

The data shows a dip at the 2009-2010 period which could be attributed to the global slump in the car sales as a result of the recession

Exploration on Odometer.

In [31]:
df_cars1['odometer'].describe() #summary of odometer feature
Out[31]:
count    9.581400e+04
mean     1.208595e+05
std      1.275137e+05
min      0.000000e+00
25%      7.449600e+04
50%      1.150000e+05
75%      1.557150e+05
max      1.000000e+07
Name: odometer, dtype: float64
In [32]:
(df_cars1['odometer']==0).sum()#cars with zero odometer reading
Out[32]:
148
In [33]:
#Odometer tyical values and delete outliers
sns.boxplot(y='odometer', data=df_cars1)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x251a3baafd0>

The data is heavily skewed as observed in the boxplot

In [34]:
#Visualizing using a densityplot
x = df_cars1['odometer']
sns.distplot(x, kde=False)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x251a390a2b0>
In [35]:
df_cars1 = df_cars1[(df_cars1['odometer'] >0)]#removing all the zero valued odometer readings
In [36]:
df_odometerlog  = np.log10(df_cars1['odometer'])#visualizing the distribution after log transformation
x = df_odometerlog
sns.distplot(x, kde=True);

The distribution is more normalized now

In [37]:
df_odometerlog = pd.DataFrame(df_odometerlog)
df_odometerlog.head()#changing log transformed data as a dataframe and visualizing the top values
Out[37]:
odometer
13 5.287914
28 4.929419
29 4.637670
35 5.190332
42 5.278754
In [38]:
df_cars1['odometer'] = df_odometerlog#mapping to the originaldataframe
df_cars1['odometer'].head()
Out[38]:
13    5.287914
28    4.929419
29    4.637670
35    5.190332
42    5.278754
Name: odometer, dtype: float64

Transforming the cylinder values with values ranging from 0-12

In [39]:
cylinder_num = {"cylinders": {"3 cylinders":3,"4 cylinders": 4, "5 cylinders": 5, "6 cylinders": 6,"8 cylinders": 8,
                             "10 cylinders" :10,"12 cylinders": 12,"other" :0 }}
cylinder_num
df_cars1.replace(cylinder_num, inplace=True)
df_cars1['cylinders'].value_counts()
Out[39]:
6     33563
8     31302
4     28763
5       929
10      760
0       187
3       127
12       35
Name: cylinders, dtype: int64
In [40]:
df_cars1['condition'].unique()#unique values
Out[40]:
array(['excellent', 'like new', 'good', 'fair', 'new', 'salvage'],
      dtype=object)
In [41]:
#transforming the ratings of car according to the standard values in Kelly blue book website
df_cars1['condition'] = df_cars1["condition"].replace('like new', "fair")
df_cars1['condition'] = df_cars1["condition"].replace('new', "fair")
df_cars1['condition'] = df_cars1["condition"].replace('salvage', "poor")
In [42]:
df_cars1['condition'].value_counts()
Out[42]:
excellent    48093
good         32119
fair         15308
poor           146
Name: condition, dtype: int64
In [43]:
df_cars1.head()
Out[43]:
region price year manufacturer model condition cylinders fuel odometer transmission drive size type paint_color state lat long
13 denver 7995 2010 chevrolet silverado 1500 4wd excellent 8 gas 5.287914 automatic 4wd full-size truck white co 39.8302 -105.0370
28 greensboro 16000 2011 bmw 535i excellent 6 gas 4.929419 automatic fwd full-size sedan grey nc 35.5895 -82.5671
29 syracuse 10950 2011 buick lucerne cxl v6 excellent 6 gas 4.637670 automatic fwd full-size sedan red ny 43.1226 -76.1284
35 syracuse 4500 2012 ford fusion sel excellent 6 gas 5.190332 automatic 4wd mid-size sedan silver ny 43.4427 -76.5108
42 richmond 2800 2002 nissan xterra fair 6 gas 5.278754 automatic 4wd full-size SUV silver va 37.7202 -77.0998
In [44]:
df_cars1['fuel'].value_counts() #information on fuel types
Out[44]:
gas         86885
diesel       7683
hybrid        883
other         125
electric       90
Name: fuel, dtype: int64

Data Visualization

In [45]:
# Display the missing values
plt.figure(figsize=(12,12))
plt.title("Missing values for each column")
sns.heatmap(data.isnull())
plt.show()
Top 10 manufacturers
In [46]:
#plotting the manufacturers using bar plots
manufacturers_top10 = df_cars1['manufacturer'].value_counts().iloc[:10]
manufacturers = pd.DataFrame({'manufacturer': manufacturers_top10.index, 'count': manufacturers_top10.values})
plt.figure(figsize=(15,10))
ax = sns.barplot(y='manufacturer',x='count',data=manufacturers, order=manufacturers['manufacturer'],palette="cubehelix");
ax.set_yticklabels(ax.get_yticklabels(),rotation=90, ha="right",fontsize=10);

Ford is the largest manufacturer of cars in USA

Top 10 years in car production
In [47]:
#Plotting top car production years
plt.figure(figsize=(15,10))
years_top10 = df_cars1['year'].value_counts().iloc[:10]
years = pd.DataFrame({'year': years_top10.index, 'count': years_top10.values})
ax = sns.barplot(x='year', y='count', data=years, order = years['year']);
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right",fontsize=8);
plt.title("Top 10 years car production",size = 25);
fig = plt.figure(figsize = (20,20))
<Figure size 1440x1440 with 0 Axes>

2013 is the year with highest number of cars produced

Condition of the vehicle
In [48]:
#Plotting the condition of the vehicles
cond_top = df_cars1['condition'].value_counts()
condition = pd.DataFrame({'Condition': cond_top.index, 'No.of vehicles': cond_top.values})
plt.figure(figsize=(15,8))
plt.title('Condition of the vehicles',size =25)
ax = sns.barplot(y='Condition',x='No.of vehicles',data=condition, order=condition['Condition'],palette='YlGnBu');
ax.set_yticklabels(ax.get_yticklabels(),fontsize=20);

Majority vehicles are in excellent condition

Fuel type Vs No.of vehicles
In [49]:
#Plotting fuel type against no of ehicles
fueltype = df_cars1['fuel'].value_counts()
fuel = pd.DataFrame({'Fuel': fueltype.index, 'count': fueltype.values})
plt.figure(figsize=(15,8))
plt.title('Fuel type Vs No.of vehicles',size =25)
ax = sns.barplot(x='Fuel',y='count',data=fuel, order=fuel['Fuel'],palette="YlGnBu");
ax.set_xticklabels(ax.get_xticklabels(), ha="right",fontsize=20);

Vehicles running on gasoline takes up the lions share of the market

Price variation of cylinders
In [50]:
plt.figure (figsize = (12,5))
plt.ylabel ('Price',size =15)
plt.xlabel ('# of Cylinders',size=15)
plt.title('Price Variation of Cylinders',size=25)
plt.xticks(size=10)
plt.yticks(size=10)
Cylin_df = df_cars1.groupby ('cylinders')['price'].count()
plt.plot(Cylin_df)
plt.show()
Price variation by transmission type
In [51]:
plt.figure (figsize = (12,5))
plt.ylabel ('Price',size =15)
plt.xlabel ('Transmission Type',size=15)
plt.title('Price Variation of Transmission Type',size=25)
plt.xticks(size=10)
plt.yticks(size=10)
Cylin_df = df_cars1.groupby ('transmission')['price'].count()
plt.plot(Cylin_df)
plt.show()
Price variation Vs type
In [52]:
plt.figure (figsize = (15,5))
plt.ylabel ('Price',size =15)
plt.xlabel ('Type',size=15)
plt.title('Price Variation Vs Type of vehicle',size=25)
plt.xticks(size=10)
plt.yticks(size=10)
Cylin_df = df_cars1.groupby ('type')['price'].count()
plt.plot(Cylin_df)
plt.show()
Price variation Vs Manufacturer
In [53]:
plt.figure (figsize = (20,5))
plt.ylabel ('Price',size =15)
plt.xlabel ('Manufacturer',size=15)
plt.title('Price Variation Vs Manufacturer',size=25)
plt.xticks(size=10)
plt.yticks(size=10)
plt.xticks(rotation=45)
Cylin_df = df_cars1.groupby ('manufacturer')['price'].count()
plt.plot(Cylin_df)
plt.show()
Visualization of price distribution of cars
In [54]:
pricecar = df_cars1[['year','price']]
pricechange = pricecar[pricecar['year']>=1990]
plt.scatter(pricechange['year'], pricechange['price'])
plt.title('year vs price ')
plt.xlabel('year')
plt.ylabel('Price Billion $')
plt.show()
Visualization based on condition of Cars
In [55]:
plt.figure(figsize=(10,20))
labels = pd.DataFrame(df_cars["condition"].value_counts())
plt.pie(df_cars["condition"].value_counts(), labels = labels.index, autopct='%.2f')
plt.show()
plt.figure(figsize=(10,20))
labels = pd.DataFrame(df_cars["fuel"].value_counts())
plt.pie(df_cars["fuel"].value_counts(), labels = labels.index, autopct='%.2f')
plt.show()
plt.figure(figsize=(10,20))
labels = pd.DataFrame(df_cars["transmission"].value_counts())
plt.pie(df_cars["transmission"].value_counts(), labels = labels.index, autopct='%.2f')
plt.show()
Fuel Type and color of cars
In [56]:
gasLabels = data[data["fuel"]=="gas"].paint_color.value_counts().head(50).index
gasValues = data[data["fuel"]=="gas"].paint_color.value_counts().head(50).values
dieselLabels = data[data["fuel"]=="diesel"].paint_color.value_counts().head(50).index
dieselValues = data[data["fuel"]=="diesel"].paint_color.value_counts().head(50).values
electricLabels = data[data["fuel"]=="electric"].paint_color.value_counts().head(50).index
electricValues = data[data["fuel"]=="electric"].paint_color.value_counts().head(50).values

from plotly.subplots import make_subplots

# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=gasLabels, values=gasValues, name="Gas Car"),
              1, 1)
fig.add_trace(go.Pie(labels=dieselLabels, values=dieselValues, name="Diesel Car"),
              1, 2)
fig.add_trace(go.Pie(labels=electricLabels, values=electricValues, name="Electric Car"),
              1, 3)

fig.show()
Share of each cylinder type
In [57]:
cylindersframe = pd.DataFrame({"Cylinders":data.cylinders.value_counts().index,"Car_cylinders":data.cylinders.value_counts().values})
cylindersframe["Cylinders"] = cylindersframe["Cylinders"].apply(lambda x : "" + str(x))
cylindersframe.set_index("Cylinders",inplace=True)
p1 = [go.Pie(labels = cylindersframe.index,values = cylindersframe.Car_cylinders,hoverinfo="percent+label+value",hole=0.1,marker=dict(line=dict(color="#000000",width=2)))]
layout4 = go.Layout(title="Cylinders Pie Chart")
fig4 = go.Figure(data=p1,layout=layout4)
iplot(fig4)
Odometer values vs Manufacturers
In [58]:
data=data.sort_values(by=['odometer'],ascending=False)
plt.figure(figsize=(25,15))
sns.barplot(x=data.manufacturer, y=data.odometer)
plt.xticks(rotation= 90)
plt.xlabel('Manufacturer',fontsize = 25)
plt.ylabel('Odometer',fontsize = 25)
plt.show()
Manufacturer vs Price
In [59]:
plt.figure(figsize=(25,15))
sns.barplot(x=data.manufacturer , y=data.price)
plt.xticks(rotation= 90)
plt.xlabel('Manufacturer')
plt.ylabel('Price')
plt.show()
Vehicle count vs Manufacturer
In [60]:
plt.figure(figsize= (20,15))
plt.xlabel('Vehicle type',fontsize = 20)
plt.ylabel('Count',fontsize = 20)
plt.title("Types of vehicle",fontsize =20)
data.type.value_counts(dropna=False).plot(kind = "bar")
plt.show()
Car sales per state
In [61]:
statecount = df_cars1.state.value_counts()
statecount.index = statecount.index.map(str.upper)
datamap = dict(type='choropleth',
            colorscale = 'Reds',
            locations = statecount.index,
            z = statecount,
            locationmode = 'USA-states',
            marker = dict(line = dict(color = 'rgb(255,255,255)',width = 2)),
            colorbar = {'title':"Cars listed per State"}
            ) 

layout = dict(title = 'Cars listed per State',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )
choromap = go.Figure(data = [datamap],layout = layout)
iplot(choromap)

As we can see from the visualization above California has the highest number of cars sales followed by Texas

Encoding categorical data

Now we would convert the categorical variables to integers so the machine learning model performs well.

In [62]:
# the columns we would be selecting for futher analysis
data_encode = df_cars1[['price','region','year','manufacturer','condition','cylinders','fuel', 'odometer', 'transmission', 'drive','size','type','paint_color','state']]
data_encode.head()#creating a dataframe with just the categorical features and ordinal features
Out[62]:
price region year manufacturer condition cylinders fuel odometer transmission drive size type paint_color state
13 7995 denver 2010 chevrolet excellent 8 gas 5.287914 automatic 4wd full-size truck white co
28 16000 greensboro 2011 bmw excellent 6 gas 4.929419 automatic fwd full-size sedan grey nc
29 10950 syracuse 2011 buick excellent 6 gas 4.637670 automatic fwd full-size sedan red ny
35 4500 syracuse 2012 ford excellent 6 gas 5.190332 automatic 4wd mid-size sedan silver ny
42 2800 richmond 2002 nissan fair 6 gas 5.278754 automatic 4wd full-size SUV silver va
In [63]:
# Determination the categorical features by sepreating them from the ordinal values
numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_columns = []
features = data_encode.columns.values.tolist()
for col in features:
    if data_encode[col].dtype in numerics: continue
    categorical_columns.append(col)
# Encoding categorical features
for col in categorical_columns:
    if col in data_encode.columns:
        le = LabelEncoder()
        le.fit(list(data_encode[col].astype(str).values))
        data_encode[col] = le.transform(list(data_encode[col].astype(str).values))
data_encode.head(10)#visualizing the encoded values
Out[63]:
price region year manufacturer condition cylinders fuel odometer transmission drive size type paint_color state
13 7995 80 2010 7 0 8 2 5.287914 0 0 1 10 10 5
28 16000 130 2011 4 0 6 2 4.929419 0 1 1 9 5 27
29 10950 352 2011 5 0 6 2 4.637670 0 1 1 9 8 34
35 4500 352 2012 12 0 6 2 5.190332 0 0 2 9 9 34
42 2800 287 2002 30 1 6 2 5.278754 0 0 1 0 9 45
48 25900 324 2008 12 0 8 2 4.863323 0 2 1 10 10 31
62 4999 130 2007 18 2 6 2 5.056035 0 2 2 9 3 27
63 8599 130 2015 9 2 4 2 4.585280 0 1 2 9 10 27
65 4800 130 2008 31 2 4 2 5.025785 0 1 3 3 9 27
66 27900 130 2015 19 2 6 2 4.780555 0 2 2 2 9 27

Now we have the data all cleaned and encoded we can start to perform multiple machine learning algorithms inorder to see which performs best`

Machine Learning Techniques

Multiple Linear Regression

In [64]:
X= data_encode.drop('price',axis = 1)#dropping the target variables
X.head()
Out[64]:
region year manufacturer condition cylinders fuel odometer transmission drive size type paint_color state
13 80 2010 7 0 8 2 5.287914 0 0 1 10 10 5
28 130 2011 4 0 6 2 4.929419 0 1 1 9 5 27
29 352 2011 5 0 6 2 4.637670 0 1 1 9 8 34
35 352 2012 12 0 6 2 5.190332 0 0 2 9 9 34
42 287 2002 30 1 6 2 5.278754 0 0 1 0 9 45
In [65]:
y = data_encode.price#target variable
y.head()
Out[65]:
13     7995
28    16000
29    10950
35     4500
42     2800
Name: price, dtype: int64
In [66]:
X_train, X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state = 45)#splitting the data 80pc as training and 20 as testing
In [67]:
# fit to training set
regres = LinearRegression()#mapping the linear regression model
regres.fit(X_train, y_train)#fitting the model
Out[67]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
In [68]:
y_pred1 = regres.predict(X_test)#predicting on the testing data
In [69]:
print('Accuracy:' ,r2_score(y_test, y_pred1))
Accuracy: 0.5932785322798609
In [70]:
# Errors of the regression model
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred1))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred1))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred1)))
Mean Absolute Error: 4205.851473125476
Mean Squared Error: 37957281.65560092
Root Mean Squared Error: 6160.948113367042

Ridge Regression

In [71]:
ridge1 = RidgeCV(alphas = [0.01,0.1,0.5,1.0,10.0,100], normalize = True)#mapping the ridge regressor with alpha values
ridge1.fit(X_train, y_train) #fitting the model            
y_pred = ridge1.predict(X_test)  #predicting values
# best alpha
print(ridge1.alpha_)
print('Accuracy:' ,r2_score( y_test, y_pred))#printing the accuracy scores
0.01
Accuracy: 0.5930470471728893
In [72]:
#Errors
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error: 4201.803342906198
Mean Squared Error: 37978885.003596425
Root Mean Squared Error: 6162.701112628814

Lasso Regression

In [73]:
lasso = LassoCV(alphas=[0.0001,0.001,0.01,0.1,0.5,1.0])#mapping the lasso regressor with input alpha values
lasso.fit(X_train,y_train)#fitting the model
# best alpha
print(lasso.alpha_)
y_pred1 = lasso.predict(X_test)
print('Accuracy:' ,r2_score( y_test, y_pred1))
0.01
Accuracy: 0.5932785640466395
In [74]:
#Error values
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred1))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred1))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred1)))
Mean Absolute Error: 4205.8512948656635
Mean Squared Error: 37957278.69096626
Root Mean Squared Error: 6160.94787276814

Random Forest Regression

Finding the best number of estimators for random forest regressionrom a range of values for parameter tuning

In [75]:
estimators = [10,20,30,40,50,60,70,80,100,110,120]
mean_rfrs = []
std_rfrs_upper = []
std_rfrs_lower = []
yt = [i for i in data_encode['price']] # quick pre-processing of the target
np.random.seed(11111)
for i in estimators:
    model = rfr(n_estimators=i,max_depth=None)
    scores_rfr = cross_val_score(model,X,y,cv=3,scoring='explained_variance')
    print('estimators:',i)
#     print('explained variance scores for k=10 fold validation:',scores_rfr)
    print("Est. explained variance: %0.2f (+/- %0.2f)" % (scores_rfr.mean(), scores_rfr.std() * 2))
    print('')
    mean_rfrs.append(scores_rfr.mean())
    std_rfrs_upper.append(scores_rfr.mean()+scores_rfr.std()*2) # for error plotting
    std_rfrs_lower.append(scores_rfr.mean()-scores_rfr.std()*2) # for error plotting
estimators: 10
Est. explained variance: 0.81 (+/- 0.02)

estimators: 20
Est. explained variance: 0.82 (+/- 0.02)

estimators: 30
Est. explained variance: 0.82 (+/- 0.02)

estimators: 40
Est. explained variance: 0.82 (+/- 0.02)

estimators: 50
Est. explained variance: 0.82 (+/- 0.01)

estimators: 60
Est. explained variance: 0.82 (+/- 0.01)

estimators: 70
Est. explained variance: 0.82 (+/- 0.02)

estimators: 80
Est. explained variance: 0.83 (+/- 0.01)

estimators: 100
Est. explained variance: 0.83 (+/- 0.02)

estimators: 110
Est. explained variance: 0.83 (+/- 0.02)

estimators: 120
Est. explained variance: 0.83 (+/- 0.01)

In [76]:
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.plot(estimators,mean_rfrs,marker='o',
       linewidth=2,markersize=10)
ax.fill_between(estimators,std_rfrs_lower,std_rfrs_upper,
                facecolor='red',alpha=0.1,interpolate=True)
ax.set_ylim([0,1])
ax.set_xlim([0,250])
plt.title('Expected Variance of Random Forest Regressor')
plt.ylabel('Expected Variance')
plt.xlabel('Trees in Forest')
plt.grid()
plt.show()

From the plot above we can see that there is not much effect on varying the number of trees on the accuracy score of the model.Hence we can use either 110 or 200 trees

In [78]:
# with 110 trees
regressor = rfr(n_estimators = 110, random_state = 42)
regressor.fit(X_train,y_train)
Out[78]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=110, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)
In [79]:
y_pred = regressor.predict(X_test)
print('Accuracy:' ,r2_score( y_test, y_pred))
Accuracy: 0.857641848801272
In [80]:
# Errors
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error: 1992.191241724932
Mean Squared Error: 13285574.698847357
Root Mean Squared Error: 3644.9382297711654

Feature Importance

Eliminating irrelevant features helps in better performance of the model.By using backward elimination with the level of significance set at 0.01 we can filter through the features selecting only the relevant features

In [81]:
#Backward Elimination method for feature selection 
import statsmodels.api as sm
cols = list(X.columns)
pmax = 1
while (len(cols)>0):
    p= []
    X_1 = X[cols]
    X_1 = sm.add_constant(X_1)
    model = sm.OLS(y,X_1).fit()
    p = pd.Series(model.pvalues.values[1:],index = cols)      
    pmax = max(p)
    feature_with_p_max = p.idxmax()
    if(pmax>0.01):
        cols.remove(feature_with_p_max)
    else:
        break
selected_features_BE = cols
print(p.sort_values(ascending = False))
selected_features_BE
paint_color      1.825336e-03
size             3.571071e-04
type             1.288338e-07
state            6.556366e-10
manufacturer     2.226272e-74
condition        1.541276e-88
transmission    4.986922e-306
drive            0.000000e+00
odometer         0.000000e+00
fuel             0.000000e+00
cylinders        0.000000e+00
year             0.000000e+00
dtype: float64
Out[81]:
['year',
 'manufacturer',
 'condition',
 'cylinders',
 'fuel',
 'odometer',
 'transmission',
 'drive',
 'size',
 'type',
 'paint_color',
 'state']
In [82]:
data_encode.head()
Out[82]:
price region year manufacturer condition cylinders fuel odometer transmission drive size type paint_color state
13 7995 80 2010 7 0 8 2 5.287914 0 0 1 10 10 5
28 16000 130 2011 4 0 6 2 4.929419 0 1 1 9 5 27
29 10950 352 2011 5 0 6 2 4.637670 0 1 1 9 8 34
35 4500 352 2012 12 0 6 2 5.190332 0 0 2 9 9 34
42 2800 287 2002 30 1 6 2 5.278754 0 0 1 0 9 45
In [83]:
#selecting all the feature selected by after backward elimination
data_rfe = data_encode[['year',
 'manufacturer',
 'condition',
 'cylinders',
 'fuel',
 'odometer',
 'transmission',
 'drive',
 'size',
 'type',
 'paint_color',
 'state','price']]

Running the random forest regressor on the selected features

In [84]:
X = data_rfe.drop('price',axis = 1)#dropping target variable
y = data_rfe.price
X_train, X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state = 45)#split train test
In [85]:
regressor = rfr(n_estimators = 100, random_state = 42)#mappin regressor
regressor.fit(X_train,y_train)#running random forest regressor without the region feature
y_pred = regressor.predict(X_test)
print('Accuracy:',r2_score( y_test, y_pred))
Accuracy: 0.8586867344974106

Backward elimination has only dropped region feature,while running the model without that feature it shows a slight accuracy increase from the previous score of 0.857641848801272 ,but not a significant rise

XGBoosting Regression

In [87]:
# fit the model
xgb = XGBRegressor(random_state =42,silent=True)#mapping xgb regressor
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print('Accuracy:',r2_score( y_test, y_pred))
Accuracy: 0.7918443077041692

This model has an accuracy score lower than that of random forest

As we can see from the different model performances we can conclude that random forest regression has the highest predictive accuracy with 85.86%

Cross Validation

To prevent over fitting we performed 10 fold cross validation of the random forest regressor model which gave the best accuracy score

In [89]:
scores_regressor = cross_val_score(regressor, X, y, cv=10)
print(scores_regressor.mean())
0.8384790186253068
In [90]:
"Est. explained variance: %0.2f (+/- %0.2f)" % (scores_regressor.mean(), scores_regressor.std() * 2)
Out[90]:
'Est. explained variance: 0.84 (+/- 0.05)'

The cv score score implies that the random forest model generalizes well and is not overfitting

Conclusion

The reason for the underperformance of multiple linear regression is that the data is not linear. Since random forest captures non-linear interactions between the features and the target, it works well with our dataset predicting the price of the used cars with an accuracy rate of 85.7 percent.

References